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CHAPTER I 


INTRODUCTION 

Digitizing speech for transmission over a communications 
link between a manned spacecraft and a ground station is re- 
quired if the voice channel is to be integrated into a serial 
bit stream along with PCM (Pulse Code Modulation) telemetry 
data. If standard PCM techniques are used for digitizing 
the speech, the resulting bit rates can be high (about 40,000 
bits per second) . When the bandwidth of a communication link 
must be minimized to stay within the available transmitter 
power allotment, steps must be taken to keep this bit rate 
as low as possible and still meet the minimum channel quality 
requirements. The two parameters that determine the serial 
bit rate, namely, the sample rate and the number of quantiza- 
tion levels, can be reduced until sampling errors and quanti- 
zation errors generate distortion products (noise) in sufficient 
quantity to render the decoded speech unacceptable. 

Linear digital encoding is used by the Bell Telephone 
Company's digital voice communication links (Kleiner, 1969). 

The analog speech is sampled at 8000 samples per second. 
Quantization noise is kept very low by quantizing with 32 
levels (5 bits per sample) . This results in a very good 
communication link but requires a wide transmission bandwidth. 
Many of the telephone channels are digitized in Japan. Nonlinear 



encoding techniques are used to keep the overall channel 
voice quality high with low bit rates (Taki, 1966). 

The purpose of this thesis is to analyze the cause and 
develop techniques to minimize the effects of quantization 
noise in a digitized speech channel. This is accomplished 
in five chapters covering the waveform characteristics of 
speech, the process of digitizing speech, the sources of 
quantization noise, the causes of speech degradation by 
quantization noise, and the techniques to minimize quantization 
noise . 

The characteristics of speech are analyzed in Chapter II. 
Human utterances can be divided into vowel and consonant 
sounds. Vowel sounds contain the most power and are formed 
by vibrating the vocal cords and manipulating an open vocal 
tract. Vowel sounds are essentially periodically damped 
sine waves. Consonant sounds are formed by constrictions 
in the vocal tract. Consonant waveforms are very complex 
and many times look like white noise in amplitude-versus-time 
plots . 

Analog-to-digital conversion using PCM techniques is 
explained in Chapter III by using simple waveforms such as 
sine waves. The same techniques are then applied for digi- 
tizing speech. The speech waveforms are used instead of 
sine waves to show how speech is digitized to achieve the 
maximum intelligibility for a minimum resulting bit rate. 


2 



Optimum bandpass filters, sample rates, and quantization 
levels will be derived and discussed. 

The optimum digital channel can be developed by mini- 
mizing the sample rate and the number of quantization levels 
until sampling errors and quantization noise degrade the 
channel performance below the minimum acceptable level. The 
source of sampling errors and quantization noise is analyzed 
in Chapter IV. The theoretical amount of quantization noise 
for several quantization levels is compared to laboratory 
experimental results. Optimum theoretical parameters for a 
low-bit-rate digital voice channel are also derived and com- 
pared with experimental results. 

Techniques for minimizing the effects of quantization 
noise will be discussed in Chapter V. One of the successful 
methods of reducing the amount of quantization noise at low 
bit rates is to employ nonlinear quantization techniques. 
Greater weighting is given to the high frequency, low ampli- 
tude speech components. This is achieved through logarithmic 
compression prior to analog-to-digital conversion or a non- 
linear ladder network is used in the analog-to-digital con- 
verter. The converse processing is carried out during or 
following the digital-to-analog conversion process to restore 
the original amplitude versus frequency relationships of the 
input speech. The results of laboratory tests are presented 
and discussed for an analog compression and expansion system 

that was used with a linear PCM system for speech. 
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CHAPTER II 


WAVEFORM CHARACTERISTICS OF SPEECH 

Spoken words and sentences are a sequence of sounds 
with little meaning individually, however, when one of these 
sounds replaces another in an utterance, the meaning is 
changed. These basic sounds are called phonemes. Human 
speech is a moderately complicated single (amplitude) func- 
tion of time composed of smooth changes from one fairly 
stable sound (phoneme) to another. Each phoneme is produced 
by modulating either the radiated sound from the vocal cords 
(voiced sounds) or the sound produced by air forced through 
a constriction formed by the tongue, teeth, or lips (unvoiced 
sounds) . 

The vowel sounds or voiced sounds are produced by the 
vocal cord excitation of the vocal tract. The vocal tract, 
consisting of the lungs, trachea, larynx, mouth, and nasal 
passages, maintains a relatively stable configuration for 
the 12 vowel sounds in the English language. The position 
of the tongue with respect to the roof of the mouth changes 
the resonant frequency of the tract and consequently the 
harmonic components of the voiced sound. Each voiced sound 
corresponds to its own combination of harmonics or "formants." 
The fundamental harmonic is the pitch frequency or rate of 
vibration of the vocal cords. The pitch frequency normally 
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lies between 80 and 200 Hz for men and 150 to 350 Hz for 
women (Flanagan, 1965) . The position of the tongue in the 
vocal tract determines which harmonics of the pitch fre- 
quency are reinforced or suppressed. Each vowel sound is 
characterized by its combination of harmonics (formants) . 
Figure 2-1 shows the amplitude versus frequency distribution 
of eight of the twelve vowel sounds. Note that each sound 
can be characterized by four or less harmonics (formants) 
of the pitch frequency. The major power in many vowel sounds 
is centered around the third formant (400 - 800 Hz) . More 
than eighty percent of the power in the vowel sounds is con- 
centrated below 1500 Hz (Fletcher, 1961). 

The consonant sounds are much more complex than the 
vowel sounds. These sounds are formed either by constric- 
tions in the vocal tract which cause air turbulance (unvoiced) 
or by a combination of constrictions and vocal tract excita- 
tion (voiced) . The intelligence bearing portion of consonant 
sounds is contained in the frequency range of 300 to 5000 Hz. 
The amplitude versus frequency spectrums of some consonant 
sounds, shown in Fig. 2-2, indicate that the major amplitude 
concentrations are between 2000 and 3000 Hz. 
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CHAPTER III 


SPEECH DIGITIZATION 

A voice channel is degraded when the minimum output S/N 
requirements are not met. One solution of this problem is 
to encode the speech amplitude information using PCM encoding 
techniques. Encoded speech in the form of a binary bit 
stream can be time-division multiplexed with other digitized 
data bit streams. During the transmission, or following 
detection, the encoded bit stream can also be regenerated to 
preserve the original signal-to-noise relationship. 

One of the first steps in developing a digital voice 
communication system is to define the voice channel perfor- 
mance requirements. If a voice channel is to provide suffi- 
cient quality to allow the perception of speaker identity 
and emotional status in addition to all utterances, the 
channel must have a wide frequency response (100 to 10,000 
Hz) and a high speech-to-noise power ratio (in excess of 
30 db) to reproduce all of the waveform components of vowel 
and consonant sounds. However, if the correct perception 
of all spoken words is the prime requirement, the required 
channel bandwidth can be much less. The listener can 
correctly interpret all sentence information, based on sen- 
tence and phonetic context, without using all of the higher 
frequency consonant components. Perception of key words in 
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sentences defines the complete message since some words may 
be redundant or assumed by context. Perception of key harmon- 
ics of vowel and consonant sounds are sufficient to recognize 
the complete syllable or monosyllable word. Laboratory tests 
have shown that no degradation of channel word intelligibility 
results if speech is passed through a low-pass filter with a 
3 db cutoff of 2.5 KHz and frequency rolloff of 36 db per 
octave (Schmidt, 1968) . A word intelligibility degradation 
of only 5 percent was measurable when the 3 db cutoff was 
reduced to 1.5 KHz. Since the objective of this work is to 
develop an intelligible digital voice channel with a low bit 
rate, it is assumed that a minimum bandwidth analog speech 
spectrum is desired. 

The first step in encoding an analog voice signal is to 
periodically sample it. Shannon's sampling theorem states 
that the minimum number of samples required to completely 
represent a time function f (t) , not containing frequency 
components higher than W Hz , can be expressed as (Taki, 

1966) 



sin tt (2wt-n) 
tt (2wt-n) 


(3-1) 


where f (^j denotes the values of f(t) at time intervals 
spaced 1/2 w seconds apart. The function f (t) is uniquely 
determined by this set of sampled values. For instance, if 
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a speech signal is filtered such that no components greater 
than 2500 Hz are present, the minimum sample rate would be 
5000 samples per second. In a conventional analog-to-digital 
converter, the periodic samples of the input signal (Fig. 3-1 (a)) 
form a PAM (Pulse Amplitude Modulated) waveform (Fig. 3—1 (b) ) . 

To convert the PAM signal into a PCM signal, the pulse ampli- 
tude are binary encoded. The maximum input voltage range A 
is divided into equal discrete amplitude levels b volts 
apart. This is called the quantization process. The selec- 
tion of the number of quantization levels determines how 
accurately the sample amplitudes are encoded. For illustra- 
tion purposes it is assumed that eight levels are adequate 
for speech. A function representing the quantized waveform 
is shown in Fig. 3-2 (b) . Note that all voltages ranging be- 
tween ± b/2 volts of a particular level are referred to that 
level. This approximation of the input signal by rounding 
off may result in a decoded signal that differs from the 
input signal. This quantization error is evident as noise 
in the decoder output. Each sample amplitude is represented 
by a binary word containing n bits for the 2 n quantization 
levels used in the analog-to-digital conversion process. 

The binary words are then transmitted serially at a bit rate 
equal to the sample rate times the number of bits in each 
word. When the encoded bit stream reaches its destination, 
it is converted back to an analog speech waveform by a DAC 
(digital-to-analog converter).. 
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CHAPTER IV 


QUANTIZATION NOISE 

The minimum acceptable quantization noise at the digital 
voice channel's decoder output determines the minimum number 
of quantization levels required to encode each amplitude sam- 
ple. To determine the amount of quantization noise to be 
expected, an expression for the output speech- to-noise ratio 
as a function of the number of quantization levels is 
developed. 

Let the signal into the ADC (analog-to-digital converter) 
be initially quantized in s levels, with b equal spacings 
between adjacent levels (see Fig. 4-1 (a)). It is assumed 
that the input signal is referenced to zero volts (no dc com- 
ponent) . The maximum voltage excursion P is divided up 
into b equal spacings such that 

b = | (4-1) 

The quantized amplitudes are ±b/2, ±2b/2, ±3b/2, ±(s-l)(b/2), 

and the quantized samples cover a range 

A = (s-l)b volts (4-2) 

In Chapter III it was stated that the quantization process 
introduces an uncertainty error since a sample appearing at 
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the DAC output could have been due to any input signal in 
the range Aj - b/2 to A^ + b/2 volts. To calculate the 
mean-squared error voltage, it is assumed that over a long 
period of time all voltage values in the region of uncer- 
tainty eventually appear the same number of times. The 
instantaneous signal amplitude is A^ + e , with 
-b/2 * e £ b/2 , where e represents the error voltage be- 
tween the instantaneous (actual) signal and its quantized 
equivalent (Schwartz, 1959). For all values of e equally 
likely, the mean-squared value of e is 


-2 

e 


.k/2 2 2 

1/b / e z de = b /12 
-b/2 


(4-3) 


The average error is zero and the rms error or output noise 
is b//l2 = b/(2/T) volts. The maximum DAC output signal- 
to-noise ratio (assuming an infinite S/N ratio at the input 
of the ADC) in terms of peak signal voltage to rms noise 
voltage is given by 

ovr 

the corresponding power ratio is 


^ = 12 s 2 (4-5) 

or 
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or in decibels 


~R db = 10.8 + 20 log. 0 s (4-6) 

Peak speech power is approximately 14 db greater than rms 
speech power (Fletcher, 1961) . Therefore the expression for 
maximum rms speech to rms noise power ratio can be expressed 
as 

S 

db = 20 log.Q s - 3.2 (4-7) 

or 

Therefore the power ratio increases as the square of the 
number of quantization levels. The maximum output S/N ratio 
(rms .to rms) for several different quantization levels is 
given in Fig. 4-2. 

The laboratory experimental data (Fig. 4-3) shows that 
Eq. (4-7) is valid for obtaining an approximate maximum DAC 
output S/N ratio (Culver, 1969) . It should also be noted 
that the output S/N is always less than the input S/N. 

The power spectral distribution of quantization noise 
depends on the number of quantization levels and interaction 
between the input signal and the sampling function. Figure 4-4 
shows the PSD (power spectral distribution) of the output of 
a DAC for a 500 Hz sinewave that was sampled at 2000 Hz and 
quantized using 16 levels. The harmonics shown occur at 
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Fig. 4-3. Input S/N Versus Output S/N 
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Sampled at 2000 Hz and Quantized 


f 1 f , 2f ± f and 3f - f where f is the sampling 

frequency and f is the frequency of the input signal. 

To investigate this phenomenon in more detail, a series of 

PSD's were made for an input signal of 2000 Hz sampled at 

4150 Hz and quantized using 32, 16, 8 and 4 levels (see 

Figs. 4-5,- 4-6, 4-7 and 4-8). The 4150 Hz sample rate was 

chosen to insure that the sum and difference harmonics would 

not be confused with the even-ordered harmonics of f or 

m 

f . For 32 and 16 quantization levels, the only signifi- 

O 

cant components were at f , f t f , and 2f - f 
r m s m s m 

However, when 8 and 4 quantization levels were used, many 
more harmonics became significant. 

The origin of all of the major harmonics appearing at 
the output of the DAC can be explained using Fourier analysis. 
The waveforms in the ADC and DAC can be represented by their 
Fourier series 

f( t) = | l C^ 11 ^ (4-8) 

h=-<» 


T/2 



f (t) e""^ ha>ot dt 


(4-9) 


The sampling function S (t) is a periodic pulse waveform 
with a period T and a pulse width At (see Fig. 4-9 (a)). 
The magnitude of the harmonic components of the sampling 
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(qp) zh/^s^toa 
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Fig. 4-6. 2000 Hz Sinewave Sampled at 4150 Hz and Quantized 

with 16 Levels 
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Fig. 4-7. 2000 IIz Sinewave Sampled at 4150 Hz and Quantized 

with 8 Levels 



2k 


Fig. 4-8. 2000 Hz Sinewave Sampled at 4150 Hz and Quantized 

with 4 Levels 
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Fig. 4-9. ADC Sampling Function and DAC Approximating 
Function 
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function are given by C, 


h * 


/ 2 E e -jh«ot at 
t. 


(4-10) 


where 


0 ) = 


2n 

T 


E fe 


( ! 


ri hu) o t 2 _ -jhw 0 ti’ 


-i 




(4-11) 


For - t^ = At or t 2 = t^ + At , we have 


c . _E_ e -jh Uotl ( 1 - 

h h»„ e ( j 


) 


(4-12) 


sin ho) At/2 
C h = EAt hco A t/2 


j~ j (hi^ot^ + hw 0 At/2) (4—13) 


or 


f(t) = 


E4t ‘ sin h„ 0 4t/2 j(huot . t) 


h=-c 


ha) Q A t/2 


(4-14) 


where <p = hw Q (t 1 + At/2) 


If a sinusoidal signal I (t) = A cos u t is sampled by our 
sampling function in the ADC, the C h term in the Fourier 
series of the output waveform B(t) (see Fig. 4-9 (b) ) is 
given by 
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C. = / 2 EA cos U t “^ h “s t dt 
n 4 m e 


for u> Q = uj g = sample rate in radians/second 


EA fe^ {hu s -a) m) At / 2 _ e - j (hu) S -a) m ) At/2 


[ 


3 (h “s ‘ “m> 


J < hw s 


+ai m ) At/2 _ e ~j (hco s +a) m ) At/2"J 

5 <h “s + “m> J 


= EA 


At S1 
2 


sin (haig-aj^At/2 sin (h 


(ha) “U)_) At/2 + (ha 


s m 


ia) s + a)m )At/2 1 
+ a» m )At/2 J 


(4-15) 


B ( t) = I(t) S (t) 


EA At r f S1 

" “T" T . A ~1 

h=-® 


sin (ha) -u> ) At/2 sin 
s m 


(ha) -w) At/2 
s m 


(hti) +iD ) At/2 j e 

o III J 

(4-16) 


The digital-to-analog conversion process in the DAC approxi- 
mates the ADC input signal I(t) by generating a staircase 
approximation function Z(t) with step amplitudes correspond 
ing to the quantized sample amplitudes of the ADC waveform 
B (t) /see Fig. 4-9 (c) ) . Therefore Z (t) will have fre- 
quency components similar to those found in B(t) since 
the two waveforms differ only in duty cycle. Z(t) can be 
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represented by the Fourier series for B(t) where the pulse 
width is increased to the limit of At = T . Therefore Z (t) 
is of the form 


Z(t) = C 


00 /sin[2iT 


(hf -f )t] 
s m 


(hf s“ f m )t: 


sin[2Tr (hf s +f m ) t]'| j 2 7 rhf 0 t 

2Whf^f m )t ] S 


(4-17 


where f = the sampling frequency, and 

f m = the frequency of the input signal. 

The amplitude of the harmonic components of the sum and 
difference components of f g and f^ depend on the number 
of quantization levels used in the analog-to-digital conver- 
sion process. These components are also affected by the 
phase relationships between the sampling function and the 
input signal. Figure 4-10 (a) shows the 4-level staircase 
DAC approximation function for a 1000 Hz sinewave sampled by 
a 3000 pps (pulse per second) sampling function that was 
synchronized with the input signal. The, input sinewave was 
sampled in the same place for every consecutive cycle. The 
PSD plot (Fig. 4-10 (c) ) shows only the difference frequency 

f - f = 2000 KHz and f = 1000 KHz . When this same 
s m m 

input signal, f = 1000 Hz is sampled asynchronously by 

m , 

a 3050 pps sampling function, consecutive DAC approximation 
waveforms are not the same for every cycle. When the sampled 
signal and the sampling function are asynchronous, much 
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Fig. 4-10. Sampling Function Synchronous and 
Asynchronous with ADC Input 


quantization noise is generated (see Fig. 4-10 (c) ) . The 
locus formed by the corners of the DAC approximation func- 
tion when it is not synchronized to the input sinewave 
(normal case) produces the waveform shown in Fig. 4-11 when 
viewed on an oscilloscope. When the rate of sampling in- 
creases, At decreases and the error area decreases. When 
the number of quantization levels increases, the amount of 
uncertainty (or error) in the amplitudes of the steps of 
the DAC approximation function decreases. The unwanted 
harmonics generated by the quantization errors do not have 
a flat amplitude versus frequency distribution. As can be 
seen in Figs. 4-7, 4-8 and 4-10, the harmonic amplitudes 
increase as their frequencies approach that of the input 
signal or the Siam and difference frequencies hf if. 

Much of the analysis of the sinewave outputs from a 
DAC can now be applied toward the analysis of speech spectrums. 
Speech is composed of many sinusoidal components. The vowel 
sounds are very sinusoidal in nature whereas the consonant 
sounds are more similar to white noise. Figure 4-12 shows 
the PSD of the phrase, "Top dog it's necessary to show you 
have heard wasp," recorded at the output of the DAC. The 
input speech was sampled at 4000 samples per second and 64 
quantization levels were used to digitize it. Note the high 
energy content of the vowel formants centered around 500 Hz. 
This speech was very intelligible. Tapes prepared for 
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Z(t) 



Fig. 4-11. Oscilloscope Waveform at Sampling 

Function Asynchronous with ADC Input 
(DAC Output) 
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Phrase, "Top Dog, it's necessary to show you have heard wasp, 
Quantized with 64 levels 
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Fig. 4-12. PSD of DAC Output for Speech Sampled at 4000 Samples 
per Second 


quantitative evaluation provided an average WI (word 
intelligibility) of 89 percent (see Appendix A for addi- 
tional information on word intelligibility scoring and the 
significance of percent WI scores) . If speech is sampled 
at rates less than 4000 samples per second, the first dif- 
ference components (f s - f m ) of the vowel and consonant 
sounds are folded back on the input consonant and vowel 
sounds. These "harmonic distortion" products mask the de- 
sired speech components and seriously degrade the speech 
intelligibility. As the sample rate is decreased below 
4000 samples per second, the consonant soiands are affected 
first. At a sample rate of 3000 samples per second, the 
consonant sounds from 1500 to 2000 Hz are being folded back 
on themselves. At a sample rate of 2500 samples per second, 
(Fig. 4-13) , the high energy vowel sounds are being super- 
imposed on the low energy consonant sounds between 1500 and 
2000 Hz, causing a serious degradation of the speech intelli- 
gibility (WI = 69%). Therefore, the speech should be sam- 
pled at a minimum rate of 4000 samples per second and 
filtered such that no speech components with frequencies 
greater than 2000 Hz are sampled. Ideal lowpass filters are 
not realizable so a compromise selection of the maximum 
upper frequency components (hence, the minimum sample rate) 
must be made. 

Assuming that the power spectrum for the consonant 
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Phrase, "Top Dog, it's necessary to show you have heard wasp 
quantized with 64 levels 
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Pig. 4-13. PSD of DA C Output for Speech Sampled at 2500 Samples 


sounds are flat, the spectrum of the "knee" of the lowpass 
filter output is given by 


G 2 (f) = A 


1 + (f/f Q ) 


2m 


(4-18) 


where A = the power density of the flat portion of the 
spectrum, 

f Q = the half power frequency, and 
m = the rate of spectrum cutoff, i.e., m = 1 corresponds 
to 6 db per octave and m = 2 corresponds to 12 db 
per octave (see Fig. 4-14 (a) ) . 

All of the difference components hf g - f m are bounded by 

the frequency f /2 and the curve for m = 1 . For frequen- 

s 

cies greater than f , the spectrum is approximately 


G 2 (f) 




2m 


(4-19) 


and the energy in the difference components is 


V = / G 2 (f ) df = / A ( f / f ) 2m df 

a V 2 V 2 


:2m ' 1 Af o ( £ o\ 

2m - 1 [ C J 


2m- 1 


(4-20) 


To compare the energy of the difference components with 
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Fig. 4-14. Optimizing the Lowpass Filter Cutoff Frequency 
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that of the total speech power, we must first determine 
the amount of power in the total speech spectrum. A con- 
servative estimate of the total speech power may be obtained 
by assuming that the speech spectrum is flat with the maxi- 
mum energy versus frequency distribution equal to that of 
the consonant sounds. The total speech spectrum power is 
given by 


CO co 

Vg 2 = J G 2 (f)df = / T T TJJzJ 2m df 

trAf 

= __2 cosec (4-21) 


The percentage of harmonic distortion caused by the frequency 
components greater than f s /2 is given by 



(4-22) 


This quantity is plotted for five values of m in Fig. 4-14 (b) 
Lowpass filters with an amplitude versus frequency rolloff of 
36 db per octave over the cutoff frequency represent a practi- 
cal limit for audio circuits. Filters with steeper rolloff 
characteristics become bulky and expensive. With a 36 db per 
octave filter and a half-power frequency f of 2000 Hz, the 
minimum ratio of sample rate f g to f is approximately 2.5 
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to keep the harmonic distortion below 10 percent. The 
maximum value of 10 percent distortion was chosen since this 
is the level of distortion where the unwanted harmonics are 
not only noticeable but are beginning to degrade the channel's 
quality (Schmidt, 1968). Therefore, the minimum sampling 
frequency is 5000 samples per second for a bandpass filter 
with a 3 db cutoff frequency of 2000 Hz and a rolloff of 
36 db per octave. 

The minimum number of quantization levels necessary to 
achieve highly intelligible voice communications over a digi- 
tal channel is shown in Fig. 4-2. The maximum theoretical 
output S/N ratios obtainable for several quantization levels 
are given in this figure. It is assumed that the digital 
channel must perform with 100 percent sentence intelligibility. 
This is equivalent to approximately 90 percent word intelli- 
gibility. From previous test results of voice communication 
channels, it is known that a channel's output speech- to-noise 
ratio should be in excess of 15 db to obtain a word intelli- 
gibility score of 90 percent or above (Hirsh, 1969). If one 
chooses 8 quantization levels , the maximum output S/N ratio 
is approximately 15 db. However, 8 levels gives no margin 
in output S/N and a WI of 90 percent cannot be assured. A 
better choice is 16 levels. Sixteen quantization levels 
provide a maximum output S/N of approximately 20 db assuming 
the input S/N is in excess of 20 db . In summary, our minimum 
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calculated parameters for a digital voice channel with 100 
percent sentence intelligibility (assuming no uncorrectable 
channel degradation) are: (1) ADC input and DAC output 

bandpass filters with a cutoff frequency of 2000 Hz and a 
rolloff of 36 db per octave, (2) a sample rate of 5000 
samples per second, and (3) 16 quantization levels. 

Laboratory tests were performed to verify these theo- 
retical parameters (Culver, 1969). The ADC and DAC were 
operated at several sample rates and quantization levels. 
Figure 4-15 shows the results of these tests. Note that the 
previously selected parameters would have resulted in WI 
scores very close to the desired 90 percent. More intelli- 
gible output speech can be provided if the sample rate is 
increased to 6000 samples per second. Increasing the number 
of quantization levels from 16 to 32 at a sample rate of 
5000 samples per second does not give any appreciable im- 
provement. The new lowpass filter cutoff frequency is now 
6000/2.5 = 2400 Hz. Therefore, both experimental and 
calculated results support the conclusion that the minimum 
parameters for 100 percent sentence intelligibility (assuming 
no uncorrectable channel degradation) are: (1) ADC input 

and DAC output bandpass filters with a cutoff frequency of 
2400 Hz and a rolloff of 36 db per octave, (2) a sample 
rate of 6000 samples per second, and (3) 16 quantization 

levels . 
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Fig. 4-15. Sample Rate Versus W. I. for Speech Quantized with 



Power spectral distribution plots of speech with 
quantization noise show that consonant masking becomes 
very significant for 8 or less quantization levels. Figure 
4-16 shows the PSD for the input speech phrase, "Top dog, 
it's necessary to show you have heard wasp." Note the 
vowel sound formant peaks from 0 to 1000 Hz and the conso- 
nant energy between 1000 and 3000 Hz. When this phrase is 
quantized using 32 and 16 levels, the DAC output does not 
contain any noticeable quantization noise (see Figs. 4-17 
and 4-18) . By comparing the two plots for 16 and 8 quanti- 
zation levels (Figs. 4-18 and 4-19), it can be seen that 
the noise has significantly increased the total power in 
the frequency range containing the consonant frequencies 
for 8 quantization levels. When 4 levels are used (Fig. 4-20) 
the energy in the frequency band of 1000 to 3000 Hz is nearly 
twice the amount measured when 16 levels were used. This 
much noise superimposed on the consonant sounds requires the 
listener to recognize words and syllables based on vowel 
sounds and context which usually results in a low probability 
of correct message perception. 
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Phrase, "Top dog, it is necessary to show you have 
heard wasp." 
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Fig. 4-16. PSD of ADC Input 
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Phrase, "Top dog, it is necessary to show you have 

heard wasp," sampled at 6700 samples per second 



Fig. 4-17. PSD of DAC Output for Speech Quantized 
with 32 Levels 
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Phrase, "Top dog, it is necessary to show you have 

heard wasp," sampled at 6700 samples per second. 
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Fig. 4-18. PSD of DAC Output for Speech Quantized 
with 16 Levels 
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Phrase, "Top dog, it is necessary to show you have heard 
wasp," sampled at 6700 samples per second 
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PSD of DAC Output for Speech Quantized 
with 8 Levels 
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Phrase, "Top dog, it is necessary to show you have heard 
wasp," sampled at 6700 samples per second. 



Fig. 4-20. PSD of DAC Output for Speech Quantized 
with 4 Levels 
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CHAPTER V 


MINIMIZING QUANTIZATION NOISE 

Various techniques to minimize quantization noise in a 
digital voice communication channel are discussed in this 
chapter. The first approach is trivial, namely, do not 
quantize the sampled speech with any less than 16 levels. 
However, if the overall channel bandwidth must be minimized, 
there are techniques that can be employed to reduce the 
overall bit rate by reducing the number of quantization 
levels required before quantization noise becomes excessive. 
A common approach is nonlinear quantization. 

Nonlinear quantization employs unequal quantization 
steps. The steps are tapered to give fine divisions for 
low amplitude signals and course divisions for high ampli- 
tude signals (see Fig. 5-l(a)). A complimentary circuit is 
provided in the DAC to emphasize the high amplitude signals 
and de-emphasize the low amplitude signals to make the 
overall combination linear. For speech, the input signal 
is normally passed through a compression amplifier prior to 
digitizing. The output of the DAC is passed through an 
expansion network. A typical transfer function curve for 
the compression amplifier is shown in Fig. 5-1 (b) . Low 
amplitude consonant sounds are amplified while the high 
amplitude vowel sounds are attenuated in this type of 



(b) Logarithmic Compression Amplifier Transfer 
Function 


Fig. 5-1. Analog Nonlinear Encoding 




amplifier. When the output of the compression amplifier is 
digitized, the consonant sounds receive a weighting more 
equal to that given the vowel sounds in a linear system. 
Consequently, when less than 16 quantization levels are 
used to digitize the speech, the relative ratio of quanti- 
zation noise to the desired signal is more constant for 
both the consonant and vowel sounds. This reduces the effect 
of quantization noise at the expander output by increasing 
the ratio of consonant amplitudes to quantization noise amp- 
litudes. Therefore, with nonlinear encoding the consonant 
and vowel sounds are more equally affected by quantization 
noise whereas the consonant sounds are affected first in a 
linear encoding system. 

A laboratory experiment was conducted to determine the 
effects of analog companding (compression/expansion) the 
speech in conjunction with a linear ADC and DAC. The WI 
curves for 32, 16, 8 and 4 quantization levels are shown in 
Fig. 5-2. Note that the minimum parameters for achieving 
90 percent WI with a linear system, namely, a sample rate 
of 6000 samples per second and 16 quantization levels, pro- 
duce WI scores in excess of 95 percent WI when nonlinear 
encoding is employed. If nonlinear compression and expan- 
sion speech processing is used at the input of the ADC and 
the output of the DAC, the sample rate can be reduced to 
5000 samples per second and the number of quantization levels 
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can be reduced to 8 to achieve a 90 percent WI score for 
the channel. The lowpass filter cutoff frequency will also 
have to be reduced to 5000/2.5 = 2000 Hz. 

Nonlinear quantization can also be achieved by employing 
tapered steps in the analog-to-digital conversion process. 
Each sample value of the input analog signal is compared to 
reference voltages within the ADC to determine the binary 
equivalent (1 one-half of a quantization step) . If these 
reference voltage levels were adjusted to give weighted 
binary equivalent values to the sample values, the high 
amplitude vowel sounds could be weighted downward and the 
low amplitude consonant sounds could be weighted upward. 

The results would be similar to those for the analog com- 
pression amplifier at the input of a linear ADC. A nonlinear 
DAC would have to be matched to the nonlinear ADC to obtain 
an overall linear system. Nonlinear ADC's and DAC ' s are not 
readily available and must therefore be built especially for 
voice transmission. Once an ADC or DAC has been modified 
for nonlinear speech encoding or decoding, it would probably 
not be suitable for use with any other data. It is for 
these reasons that speech, and other data, is nonlinearly 
processed in its analog form so that common linear ADC and 
DAC equipment can be used. 

Another type of speech processing that can be employed 
to reduce the effects of quantization noise is pre-emphasis. 
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Recall that the consonant sounds are affected first by 
quantization noise and that they have the most power in the 
frequency band of 2000 to 3000 Hz. Therefore, if we increase 
the relative amplitude of the consonant sounds with frequen- 
cies in excess of 1000 Hz with respect to the vowel sounds 
below 1000 Hz, quantization noise would begin to degrade the 
speech at a lower ratio of rms speech- to-noise power. The 
simplest form of a pre-emphasis network is a capacitor in 
series with the ADC input. One capacitor will provide 6 db 
per octave of pre-emphasis. The most precise pre-emphasis 
network is an RC network in series with the input of the ADC 
to select the frequency at which the 6 db per octave of 
pre-emphasis begins. For example, if the RC network is de- 
signed for pre-emphasis beginning at 500 Hz, signals at 
1000, 2000 and 4000 Hz would be emphasized 6, 12 and 18 db, 
respectively, with respect to the 500 Hz signal. A speech 
purist would apply the DAC output to a speech de-emphasis 
network with a negative 6 db per octave slope to compensate 
for the unnatural sound of the speech passed through a pre- 
emphasis network. However, when pre-emphasis is employed 
in most existing voice communication systems used by male 
speakers, de-emphasis is not employed. The resulting output 
speech sounds "high-pitched" but it is very intelligible and 
can be understood better in conditions of high ambient noise. 
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CHAPTER VI 


CONCLUSIONS AND RECOMMENDATIONS 

High sentence intelligibility can be achieved from a 
linearly encoded PCM voice channel with a minimum bit rate 
of 24,000 bits per second. The minimum parameters for 100 
percent sentence intelligibility are: (1) analog-to-digital 

and digital-to-analog converter lowpass filters with a 3 db 
cutoff frequency of 2400 Hz and a rolloff of 36 db per octave, 
(2) a sample rate of 6000 samples per second, and (3) 16 

quantization levels. If less than 16 levels are used to 
quantize the input signal, the quantization noise generated 
in the digital-to-analog converter will begin to seriously 
degrade the output speech. The first speech components to 
be affected by quantization noise are the consonant sounds. 

If the low amplitude consonant sounds are amplified with 
respect to the high amplitude vowel sounds, more quantiza- 
tion noise can be tolerated at the output of the digital-to- 
analog converter. This can be accomplished by passing the 
speech through a logarithmic compression amplifier before 
encoding it in the analog-to-digital converter. The conso- 
nant sound weighting is restored at the output of the 
digital-to-analog converter by an expansion amplifier. 

With nonlinear encoding, the number of quantization levels 
can be reduced to 8 and the sample rate can be reduced to 
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5000 samples per second. The lowpass filter cutoff frequency- 
must also be reduced to 2000 Hz. 

It is recommended that more work be done to develop a 
digital- to- analog converter that is adapted to speech. The 
analog signal generation circuits should be designed to 
generate sinusoidal functions by straight-line interpolation 
instead of by filtering a staircase function. 
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APPENDIX A 


WORD INTELLIGIBILITY SCORING 

The unit, "percent word intelligibility," is a quantity 
representing the percentage of words correctly recognized at 
the output of a voice channel from a list read into its in- 
put. There are many different WI (word intelligibility) 
measurement techniques, each with its own biases of the 
output scores. All techniques attempt to make a repeatable 
measurement independent of the human subject's emotions and 
previous training. Single-syllable (monosyllable) words are 
usually used to prevent the selection of the correct word 
from adjacent syllable context. 

The WI measurement technique used to evaluate the 
different laboratory test conditions covered in this thesis 
was based on the Harvard PB (phonetically balanced) word 
technique. Fifty monosyllable PB words were selected for 
each word list to cover most of the phonemes of the English 
language. Each word was enunciated in a carrier phrase 
such as "Top dog, top dog, it is the word ant that you should 
record now," for the word "ant." The PB words and their 
carrier phrases were recorded and later played into the 
analog-to-digital converter of the simulated digital voice 
channel to determine the effects of changing specific param- 
eters. The output of the digital- to-analog converter was 
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recorded and later scored by subjects. The percentage of 
correctly perceived words was the percent WI for the channel. 

A WI score can be related to the percentage of sentences 
that could have been interpreted over the voice link under 
test. Since sentences contain a large amount of redundant 
information, the sentence intelligibility score will be 
greater than the word intelligibility score for each test 
condition. Word intelligibility scores account for the same 
factors that degrade voice channels used for conversational 
speech. Channel noise and distortion that would degrade a 
voice channel’s quality will also decrease the WI score. 
However, WI measurements will not normally account for 
sounds such as echoes or tones which are annoying but do not 
degrade the channel's intelligibility. The WI measurement 
technique. used for the tests in this thesis was repeatable 
and showed sufficient sensitivity to changes in the test 
parameter variables to allow optimization of the digital 
voice channel's performance. 
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